“Celda” stands for “CEllular Latent Dirichlet Allocation”, which is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. Celda has advantages over other clustering frameworks:
In this vignette we will demonstrate how to perform cell and gene clustering with simulated and real single cell RNA-seq data using the Bayesian hierarchical models within Celda.
Currently Celda is hosted on a Github repository compbiomed/celda.
library(devtools)
source("https://bioconductor.org/biocLite.R")
install.packages("Rcpp")
biocLite("SummarizedExperiment")
biocLite("MAST")
install_github("compbiomed/celda")
This should automatically install Celda on your local machine as long as you have proper internet connection. If you encounter other problems during the installation you may want to open up a browser and access the issues page on our github repository at https://github.com/compbiomed/celda. If you have any other questions or concerns, you could also contact our Google Group at: https://groups.google.com/forum/#!forum/celda-list.
Once installed the package can be loaded using the library command.
library(celda)
Complete list of help files are accessible using the help command with a package option.
help(package=celda)
We will learn how to use the package Celda to cluster genes and cells from count data generated in single cell RNA-seq experiments. Celda will take a matrix of counts where each row is a gene and each cell is a column and each entry in the matrix is the number of counts for each gene in each cell. For this tutorial, we will start with a simple matrix of counts that were simulated according to the generative process behind the celda_CG model.
We use the built-in data generating function simulateCells to simulate a single-cell RNA-Seq dataset. simulateCells returns a list containing a count matrix where each row is a gene and each column is a cell. The K parameter determines the number of cell clusters while the L parameter determines the number of gene clusters within the dataset. The S parameter determines the number of samples.
sim_counts <- simulateCells("celda_CG", K = 6, L = 6, S = 10)
The counts variable contains the counts matrix.
The dimensions of counts matrix:
dim(sim_counts$counts)
## [1] 1000 822
The z variable contains the cluster labels for each cell.
Here is the number of cells in each subpopulation:
table(sim_counts$z)
##
## 1 2 3 4 5 6
## 125 103 141 95 147 211
The y variable contains the transcriptional state assignment for each gene.
Here is the number of genes in each transcriptional state:
table(sim_counts$y)
##
## 1 2 3 4 5 6
## 50 27 201 374 107 241
The sample.label is used to denote the sample from which each cell was derived. In this simulated dataset, we have 10 samples:
table(sim_counts$sample.label)
##
## Sample_1 Sample_10 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6
## 86 100 94 88 95 73 58
## Sample_7 Sample_8 Sample_9
## 66 75 87
Each cell is assumed to come from a sample. Here is the number of cells in each subpopulation within each sample:
table(sim_counts$z, sim_counts$sample.label)
##
## Sample_1 Sample_10 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
## 1 25 1 2 14 11 11 12 19
## 2 5 1 9 6 19 5 7 20
## 3 3 54 2 21 10 9 14 11
## 4 3 1 32 15 11 5 2 4
## 5 25 0 2 30 31 40 5 8
## 6 25 43 47 2 13 3 18 4
##
## Sample_8 Sample_9
## 1 13 17
## 2 19 12
## 3 15 2
## 4 14 8
## 5 6 0
## 6 8 48
There are currently three models within this package: celda_C will cluster cells, celda_G will cluster genes, and celda_CG will simultaneously cluster cells and genes. The celda function will use these models and take in a counts matrix to create a celda.list object that contains the cluster labels for each cell and gene.
celda.res <- celda(counts = sim_counts$counts, model = "celda_CG", K = 6, L = 6, max.iter = 10, cores = 1, nchains = 1)
The celda.list contains 4 objects: “run.params”, “res.list”,“content.type”,and “count.checksum”. The Celda model(s) will be stored within the “res.list”.
Here is a comparison between the true cluster labels and the estimated cluster labels.
names(celda.res)
## [1] "run.params" "res.list" "content.type" "count.checksum"
model <- celda.res$res.list[[1]]
z <- model$z
y <- model$y
table(z, sim_counts$z)
##
## z 1 2 3 4 5 6
## 1 125 0 0 0 0 0
## 2 0 103 0 0 0 0
## 3 0 0 141 0 0 0
## 4 0 0 0 95 0 0
## 5 0 0 0 0 0 211
## 6 0 0 0 0 147 0
table(y, sim_counts$y)
##
## y 1 2 3 4 5 6
## 1 49 0 0 0 0 0
## 2 0 27 0 0 0 0
## 3 0 0 200 1 0 0
## 4 1 0 1 373 0 1
## 5 0 0 0 0 107 0
## 6 0 0 0 0 0 240
We can display the clustering results with a heatmap of the normalized counts, obtained by the function normalizeCounts, which normalizes a counts matrix by a scalar factor.
norm.counts <- normalizeCounts(sim_counts$counts, scale.factor = 1e6)
renderCeldaHeatmap(counts = norm.counts, z=z, y=y, normalize = NULL,
color_scheme = "divergent",cluster_gene = TRUE, cluster_cell = TRUE)
Celda can also perform matrix factorization to summarize the contribution of each transcriptional state to each cellular subpopulation. Matrix factorization decomposes the counts matrix into multiple matrices, using the given Celda model. Each of these following matrices can be viewed as raw counts values, proportional values, or posterior probability values.
factorized <- factorizeMatrix(model, sim_counts$count)
names(factorized)
## [1] "counts" "proportions" "posterior"
The gene.states contains each gene’s contribution to the transcriptional state it belongs. For our example, the matrix contains 1000 genes to 6 transcriptional state.
dim(factorized$proportions$gene.states)
## [1] 1000 6
head(factorized$proportions$gene.states)
## L1 L2 L3 L4 L5 L6
## Gene_1 0 0 0.00000000 0.000000000 0.002372693 0
## Gene_2 0 0 0.00000000 0.000000000 0.008278966 0
## Gene_3 0 0 0.02276603 0.000000000 0.000000000 0
## Gene_4 0 0 0.00000000 0.002711210 0.000000000 0
## Gene_5 0 0 0.00000000 0.004446981 0.000000000 0
## Gene_6 0 0 0.00000000 0.002785753 0.000000000 0
The cell.states contains each sample’s contribution to each transcriptional state. Here, there are 6 transcriptional states to 822 cells.
dim(factorized$proportions$cell.states)
## [1] 6 822
factorized$proportions$cell.states[,1:8]
## Cell_1 Cell_2 Cell_3 Cell_4 Cell_5 Cell_6
## L1 0.193893130 0.08523207 0.193387562 0.188118812 0.139823009 0.168926773
## L2 0.206106870 0.04050633 0.193125164 0.201980198 0.009734513 0.192173576
## L3 0.387786260 0.01687764 0.399632642 0.378217822 0.090265487 0.420767145
## L4 0.053435115 0.02784810 0.057727631 0.057425743 0.293805310 0.040294460
## L5 0.004580153 0.26160338 0.004198373 0.003960396 0.085840708 0.001549787
## L6 0.154198473 0.56793249 0.151928628 0.170297030 0.380530973 0.176288260
## Cell_7 Cell_8
## L1 0.0364931 0.134573778
## L2 0.1695594 0.009519688
## L3 0.1379617 0.079186499
## L4 0.3311081 0.284725227
## L5 0.1014686 0.101254868
## L6 0.2234090 0.390739939
The population.states contains each transcriptional states’ contribution to each of the cell states. Since K and L were set to be 6, there are 6 cell supopulations to 6 transcriptional states.
pop.states <- factorized$proportions$population.states
dim(pop.states)
## [1] 6 6
pop.states
## K1 K2 K3 K4 K5 K6
## L1 0.191743984 0.15050892 0.08830247 0.085650270 0.13057352 0.03568253
## L2 0.194996281 0.19499099 0.04624651 0.006115229 0.01112842 0.16329832
## L3 0.405224153 0.43362699 0.01274006 0.161694744 0.08475810 0.13971169
## L4 0.047080010 0.16058868 0.03708449 0.269110512 0.29805329 0.32843685
## L5 0.003092693 0.03386761 0.24896449 0.318763477 0.08426042 0.10126270
## L6 0.157862880 0.02641680 0.56666199 0.158665768 0.39122626 0.23160791
Celda contains several plotting tools for data dimensionality reduction tool outputs. The PCA result for the factorized matrix is plotted by the plotDrCluster and plotDrState function based off of the Celda clusters and state probabilities for each cell respectively:
data.pca <- prcomp(t(scale(t(factorized$proportions$cell.states))),scale = F, center = F)
plotDrCluster(dim1 = data.pca$rotation[,1], dim2 = data.pca$rotation[,2],
cluster = celda.res$res.list[[1]]$z)
plotDrState(dim1 = data.pca$rotation[,1], dim2 = data.pca$rotation[,2],
matrix = factorized$proportions$cell.states, rescale = TRUE)
In addition to the counts matrix, the transcriptional and cell states of a Celda model can also be visualized with a heatmap, which may be useful in determining which cell states highly express a particular transcriptional state. It is recommended to z-score the matrix with the scale parameter for better visualization.
renderProbabilityHeatmap(model = model, counts = sim_counts$counts,
relative = FALSE, scale = TRUE)
Sometimes it may be helpful to visualize the relative probability of each transcriptional state in each cellular subpopulation. We can do this by using the relative parameter:
renderProbabilityHeatmap(model = model, counts = sim_counts$counts,
relative = TRUE, scale = TRUE)
We will now apply Celda on the “pbmc_data” dataset, which is a single-cell RNA-seq dataset of 3000 Peripheral Blood Mononuclear Cells (PBMC), available from 10X Genomics. The rows are organized by gene names, while the columns are organized by barcodes. For this tutorial, the dataset has been slightly modified so the rownames are comprised of the Ensembl gene ID as well as the gene name. The raw dataset can be found at https://support.10xgenomics.com/single-cell/software/pipelines/latest/rkit.
To reduce computational time, we will only include genes with at least 4 counts in a least 4 cells for this analysis, which we have done already for you. However, Celda is capable of analyzing genes with fewer counts.
##Original pbmc_data: 7090 genes, 2700 cells
#pbmc_select <- pbmc_data[rowSums(pbmc_data>3) > 3,]
dim(pbmc_select)
## [1] 1321 2700
head(rownames(pbmc_select))
## [1] "ENSG00000000938_FGR" "ENSG00000002549_LAP3"
## [3] "ENSG00000002586_CD99" "ENSG00000003056_M6PR"
## [5] "ENSG00000004059_ARF5" "ENSG00000004779_NDUFAB1"
head(colnames(pbmc_select))
## [1] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1"
## [4] "AAACCGTGCTTCCG-1" "AAACCGTGTATGCG-1" "AAACGCACTGGTAC-1"
The optimal number of K and L will likely not be known a priori. Therefore multiple choices of K and L may need to tested and compared. Additionally, Celda is sensitive to initial start conditions, so it is good practice to run multiple chains for each combination of K and L to increase the chances of finding the most optimal solution. Celda is able to run multiple combinations of K and L with multiple chains for each combination in parallel.
pbmc_res <- celda(pbmc_select, K = seq(5:50,by = 5), L = seq(10:50,by = 5),
cores = 1, model = "celda_CG", nchains = 4, max.iter = 100)
The Celda package contains several methods to determine the optimal number of K upon creating the models. calculatePerplexityWithResampling calculates the perplexity for every model created, which provide a distribution of perplexities which can be visualized to determine which model to use for downstream analysis.
calc.perplexity <- calculatePerplexityWithResampling(pbmc_res,
counts = pbmc_select,
resample = TRUE)
visualizePerplexityByKL(calc.perplexity$perplexity.info)
The function gettingClusters attempts to identify the optimal number for K by looking for differential expression between each cell clusters in a single model. The data can be trimmed (i.e, all values higher than 50 are trimmed to 50) for easier visualization.
cluster.plot <- gettingClusters(celda.list = pbmc_res, matrix = pbmc_select, iterations = 3)
cluster.plot$data$negative_log[cluster.plot$data$negative_log > 50] <- 50
cluster.plot
In this example, it appears that K = 15 at L = 50 gives the best model.
We can use the function getBestModel to select the best model with the highest log-likelihood from the list of models once we have selected the optimal K and L. The replicate with the highest log likelihood will be returned from all models that were run with the selected K and L.
model.pbmc <- getBestModel(pbmc_res, K = 15, L = 50)
renderProbabilityHeatmap, as discussed above visualizes the relationship between each transcriptional state and cell subpopulation in a heatmap form.
renderProbabilityHeatmap(counts = pbmc_select, model = model.pbmc,
relative = TRUE, scale = TRUE)
The GiniPlot function utilizes the Gini coefficient, a statistical dispersion measure, to determine the significance of each transcriptional states. The plot will give users a rough estimate of which transcriptional states to look for first.
gini <- GiniPlot(counts = pbmc_select, celda.mod = model.pbmc)
gini
Celda employs the MAST package (McDavid A, 2018) for differential expression analysis of single-cell data. The diffExpBetweenCellStates function determines genes that are differentially expressed in one cell subpopulation against all other subpopulations, or between two cell subpopulations.
If you wish to compare one cell subpopulation compared to all others, leave c2 as NULL:
diff.exp.clust1 <- diffExpBetweenCellStates(counts = pbmc_select,
celda.mod = model.pbmc, c1 = 1, c2 = NULL)
diff.exp.clust1
## Gene Pr(>Chisq) log2fc fdr
## 1: ENSG00000163736_PPBP 5.959488e-41 13.71396 1.574497e-38
## 2: ENSG00000163737_PF4 4.966373e-32 12.72657 3.644766e-30
## 3: ENSG00000127920_GNG11 7.455046e-39 12.43603 1.633866e-36
## 4: ENSG00000168497_SDPR 4.863507e-38 12.42767 7.138548e-36
## 5: ENSG00000180573_HIST1H2AC 9.687060e-39 11.41711 1.633866e-36
## ---
## 1317: ENSG00000254505_CHMP4A 2.705031e-02 NaN 9.554400e-02
## 1318: ENSG00000254709_IGLL5 3.882345e-01 NaN 5.749526e-01
## 1319: ENSG00000254772_EEF1G 4.257560e-02 NaN 1.339104e-01
## 1320: ENSG00000262814_MRPL12 2.701103e-01 NaN 4.473686e-01
## 1321: ENSG00000267100_ILF3-AS1 1.230727e-01 NaN 2.630730e-01
If you wish to compare two cell subpopulations, use both c1 and c2 parameters.
diff.exp.clust1vs2 <- diffExpBetweenCellStates(counts = pbmc_select, celda.mod = model.pbmc, c1 = 6, c2 = 7)
diff.exp.clust1vs2
## Gene Pr(>Chisq) log2fc fdr
## 1: ENSG00000167286_CD3D 7.855456e-32 -7.154195 3.459019e-29
## 2: ENSG00000158869_FCER1G 1.038059e-26 6.283397 1.958965e-24
## 3: ENSG00000100453_GZMB 2.992048e-44 6.128375 3.952495e-41
## 4: ENSG00000011600_TYROBP 1.094117e-24 6.095883 1.192391e-22
## 5: ENSG00000115523_GNLY 1.091942e-19 6.029591 7.591871e-18
## ---
## 1317: ENSG00000216490_IFI30 1.000000e+00 NaN 1.000000e+00
## 1318: ENSG00000224397_SMIM25 3.076132e-01 NaN 3.797729e-01
## 1319: ENSG00000247982_LINC00926 1.134078e-01 NaN 1.800000e-01
## 1320: ENSG00000253701_ 1.000000e+00 NaN 1.000000e+00
## 1321: ENSG00000254709_IGLL5 5.895093e-01 NaN 6.478717e-01
Upon matrix factorization of the counts data, the top genes in a transcriptional state can be selected using the topRank function on the “gene.states” matrix:
factorize.matrix <- factorizeMatrix(model.pbmc, counts=pbmc_select)
top.genes <- topRank(factorize.matrix$proportions$gene.states, n = 25)
top.genes$names$L37
## [1] "ENSG00000163736_PPBP" "ENSG00000163737_PF4"
## [3] "ENSG00000180573_HIST1H2AC" "ENSG00000168497_SDPR"
## [5] "ENSG00000127920_GNG11" "ENSG00000102804_TSC22D1"
## [7] "ENSG00000101162_TUBB1" "ENSG00000010278_CD9"
## [9] "ENSG00000154146_NRGN" "ENSG00000120885_CLU"
## [11] "ENSG00000104267_CA2" "ENSG00000171611_PTCRA"
## [13] "ENSG00000113140_SPARC" "ENSG00000169704_GP9"
## [15] "ENSG00000101335_MYL9" "ENSG00000005961_ITGA2B"
## [17] "ENSG00000184113_CLDN5"
top.genes$names$L33
## [1] "ENSG00000163220_S100A9" "ENSG00000143546_S100A8"
## [3] "ENSG00000163221_S100A12" "ENSG00000162444_RBP7"
## [5] "ENSG00000110203_FOLR3"
We can make a heatmap of these top genes as follows:
top.genes.ix <- unique(unlist(top.genes$index))
norm.pbmc.counts <- normalizeCounts(pbmc_select)
renderCeldaHeatmap(norm.pbmc.counts[top.genes.ix,], z = model.pbmc$z,
y = model.pbmc$y[top.genes.ix], normalize = NULL,
color_scheme = "divergent")
While some transcriptional states contribute to the identification of cell subpopulations, some states have low statistical value to the mode, and users may find it useful to re-create the heatmap using only states that have a high enough Gini coefficient value to create a “cleaner” heatmap. For this example, we will use states that have a Gini coefficient higher than 0.3. This is an arbitrary cutoff, and users can set their own threshold for the cutoff.
filtered.tr.states <- gini$data$Transcriptional_States[gini$data$Gini_Coefficient > 0.3]
top.genes.filtered.ix <- unique(unlist(top.genes$index[as.numeric(levels(filtered.tr.states))][as.numeric(filtered.tr.states)]))
renderCeldaHeatmap(norm.pbmc.counts[top.genes.filtered.ix,], z = model.pbmc$z,
y = model.pbmc$y[top.genes.filtered.ix],
normalize = NULL, color_scheme = "divergent")
If there is a particular gene of interest, it is possible to check its transcriptional state with the lookupTranscriptionalStateofGene function.
lookupTranscriptionalStateofGene(counts = pbmc_select, model = model.pbmc,
gene = c("ENSG00000203747_FCGR3A",
"ENSG00000163736_PPBP"))
## $ENSG00000203747_FCGR3A
## [1] 39
##
## $ENSG00000163736_PPBP
## [1] 37
stateHeatmap creates a heatmap using only the genes from a specific transcriptional state, which enables the visualization of co-expression patterns of genes within the transcriptional state.
stateHeatmap(counts = pbmc_select, celda.mod = model.pbmc, state.use = 17)
Celda supports the use of tSNE (t-stochastic neighbor embedding) via the celdaTsne function. We can plot these tSNE results via plotDrCluster, plotDrState and plotDrGene. These will determine how each cell is labeled via celda, visualize each cell’s expression respectively of a specific transcriptional state and visualize each cell’s expression of a set of genes.
factorize.matrix <- factorizeMatrix(model.pbmc, counts=pbmc_select)
norm_pbmc <- normalizeCounts(factorize.matrix$counts$cell.states)
set.seed(123)
pbmc_tsne <- celdaTsne(counts = pbmc_select, celda.mod = model.pbmc, distance = "cosine")
With the use of the plots, it is also possible to determine which cell clusters express specific marker genes.
plotDrCluster(dim1 = pbmc_tsne[,1], dim2 = pbmc_tsne[,2], cluster = as.factor(model.pbmc$z))
plotDrState(dim1 = pbmc_tsne[,1], dim2 = pbmc_tsne[,2], matrix = factorize.matrix$proportions$cell.states, rescale = TRUE)
marker.genes <- c("ENSG00000168685_IL7R","ENSG00000198851_CD3E","ENSG00000105374_NKG7",
"ENSG00000203747_FCGR3A","ENSG00000090382_LYZ","ENSG00000179639_FCER1A",
"ENSG00000156738_MS4A1", "ENSG00000163736_PPBP","ENSG00000101439_CST3")
gene.counts <- pbmc_select[marker.genes,]
plotDrGene(dim1 = pbmc_tsne[,1],dim2 = pbmc_tsne[,2], matrix = gene.counts, rescale = TRUE)